Boiler: lossy compression of RNA-seq alignments using coverage vectors
نویسندگان
چکیده
We describe Boiler, a new software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions summarizing the alignments. Since most per-read data is discarded, storage footprint is often much smaller than that achieved by other compression tools. Despite this, the most relevant per-read data can be recovered; we show that Boiler compression has only a slight negative impact on results given by downstream tools for isoform assembly and quantification. Boiler also allows the user to pose fast and useful queries without decompressing the entire file. Boiler is free open source software available from github.com/jpritt/boiler.
منابع مشابه
ChIPWig: a random access-enabling lossless and lossy compression method for ChIP-seq data
Motivation Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are inexpensive and time-efficient, and result in massive datasets that introduce significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. ChIPWig enables random access, su...
متن کاملFast Indexing of Lattice Vectors for Image Compression
Visual communication is becoming increasingly important with applications in several areas such as multimedia, communication, data transmission and storage of remote sensing images, satellite images, education, medical etc....The image data occupies large space. Meeting bandwidth requirements and maintaining acceptable image quality simultaneously is a challenge. Hence image compression is requ...
متن کاملRSeQC: quality control of RNA-seq experiments
MOTIVATION RNA-seq has been extensively used for transcriptome study. Quality control (QC) is critical to ensure that RNA-seq data are of high quality and suitable for subsequent analyses. However, QC is a time-consuming and complex task, due to the massive size and versatile nature of RNA-seq data. Therefore, a convenient and comprehensive QC tool to assess RNA-seq quality is sorely needed. ...
متن کاملRail-RNA: scalable analysis of RNA-seq splicing and coverage
Motivation RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. Results We describe Rail-RNA, a cloud-enabl...
متن کاملWorking with aligned nucleotides (WORK-IN-PROGRESS!)
This vignette belongs to the GenomicAlignments package. It illustrates how to use the package for working with the nucleotide content of aligned reads. After the reads generated by a high-throughput sequencing experiment have been aligned to a reference genome, the questions that are being asked about these alignments typically fall in two broad categories: positional only and nucleotiderelated...
متن کامل